unit 3
Overview of Text Classification Tasks
1. What is Text Classification
Text Classification is the process of automatically assigning text documents to predefined categories.
In simple words, it means a computer reads text and decides which group it belongs to.
Example:
Email → Spam / Not Spam
Review → Positive / Negative
2. Why Text Classification is Needed
Today there is a huge amount of digital text such as emails, reviews, messages, and articles.
It is not possible for humans to manually classify all this data.
So, machines are used to classify text automatically.
3. Basic Example
Message: "Win money now"
Category: Spam
Message: "Meeting at 5 pm"
Category: Not Spam
The computer learns from such examples and classifies new messages.
4. Steps in Text Classification
Step 1: Data Collection
Collect labeled text data.
Example: Spam and non-spam emails.
Step 2: Text Preprocessing
Clean the text.
Includes:
Removing symbols
Removing extra spaces
Converting to lowercase
Removing common words
Example:
"I Love AI!!!" → "love ai"
Step 3: Feature Extraction
Convert text into numbers.
Methods:
Bag of Words
TF-IDF
Word Embeddings
Step 4: Model Training
Train machine learning algorithms.
Examples:
Naive Bayes
Logistic Regression
SVM
Neural Networks
Step 5: Prediction
New text is given to the model.
The model predicts the category.
5. Common Algorithms Used
Naive Bayes: Fast and simple, good for spam detection
Logistic Regression: Easy to understand, good accuracy
SVM: High accuracy for complex data
6. Applications of Text Classification
Used in:
Email filtering
Product reviews
News categorization
Chatbots
Search engines
Social media monitoring
Naive Bayes for Text Classification
Basic Idea
Naive Bayes is a probability-based classifier.
It uses past data to calculate the probability of a document belonging to a class.
It assumes that all words in a document are independent of each other.
This assumption is called the naive assumption.
Bayes Formula (Using a and b)
Let:
a = Class (Spam, Positive, etc.)
b = Document
Main formula:
P(a | b) = ( P(b | a) × P(a) ) / P(b)
Meaning:
Probability of class a when document b is given.
In practice, P(b) is same for all classes, so we use:
P(a | b) ∝ P(b | a) × P(a)
Word Independence Formula
If document b has words w1, w2, w3:
P(b | a) = P(w1 | a) × P(w2 | a) × P(w3 | a)
Final formula:
P(a | b) ∝ P(a) × ∏ P(wi | a)
Decision Rule
The class with highest probability is selected:
Class = max [ P(a) × ∏ P(wi | a) ]
Advantages and Limitations
Advantages:
-
Very fast
-
Works well for text
-
Easy to implement
Limitations:
-
Assumes words are independent
-
Zero probability problem
-
Lower accuracy for complex data
Support Vector Machine (SVM) for Text Classification
Basic Idea
SVM is a margin-based classifier.
It draws a boundary between two classes such that the distance from both classes is maximum.
This distance is called margin.
Only the closest points decide the boundary.
These points are called support vectors.
Hyperplane Formula
The separating line or plane is:
w · x + b = 0
Where:
-
w = weight vector
-
x = document vector
-
b = bias
Classification Formula
For a new document x:
f(x) = sign(w · x + b)
If value is positive → Class 1
If value is negative → Class 0
Optimization Formula
SVM tries to minimize:
(1/2) ||w||²
With condition:
y (w · x + b) ≥ 1
For noisy data:
(1/2)||w||² + C Σ ξ
Advantages and Limitations
Advantages:
-
High accuracy
-
Works well with high-dimensional text
-
Effective with TF-IDF
Limitations:
-
Slow for large datasets
-
Hard to choose kernel
-
Less interpretable
Logistic Regression for Text Classification
Basic Idea
Logistic Regression is a probability-based linear classifier.
It first finds probability, then converts it into class.
It is mainly used for binary classification.
Linear Formula
First, a linear value is calculated:
z = w · x + b
Sigmoid Formula
Sigmoid converts z into probability:
P = 1 / (1 + e⁻ᶻ)
So final formula is:
P = 1 / (1 + e⁻(w·x + b))
Decision Rule
If P ≥ 0.5 → Class = 1
If P < 0.5 → Class = 0
Loss Formula
Logistic Regression uses log loss:
L = −[ y log(p) + (1 − y) log(1 − p) ]
Advantages and Limitations
Advantages:
-
Simple and easy to understand
-
Gives probability output
-
Works well for sparse data
Limitations:
-
Only linear boundaries
-
Not good for complex patterns
-
Sensitive to outliers
Comparison of Three Algorithms
Key Differences
| Feature | Naive Bayes | SVM | Logistic Regression |
|---|---|---|---|
| Method | Probability | Margin | Probability |
| Speed | Very Fast | Medium | Fast |
| Accuracy | Medium | High | High |
| Boundary | Simple | Complex | Linear |
| Output | Class | Class | Probability |
Evaluation Metrics for Classification
In classification tasks, we measure how well a model is performing using specific metrics. These metrics help assess the quality of predictions and compare models. (Google for Developers)
Confusion Matrix
Before defining metrics, the confusion matrix is fundamental. It shows counts of correct and incorrect predictions. (GeeksforGeeks)
The matrix has four components for binary classification:
-
TP (True Positive): Model predicts positive and it is actually positive
-
TN (True Negative): Model predicts negative and it is actually negative
-
FP (False Positive): Model predicts positive but it is actually negative
-
FN (False Negative): Model predicts negative but it is actually positive
Accuracy
Accuracy measures overall correctness of the model. It is the proportion of all correct predictions. (Wikipedia)
Formula:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Accuracy is intuitive but can be misleading when classes are imbalanced. (Wikipedia)
Precision
Precision measures how many of the positive predictions are actually correct. It focuses on positive predictions. (Wikipedia)
Formula:
Precision = TP / (TP + FP)
High precision means the model’s positive predictions are reliable.
Recall
Recall measures how many of the actual positive cases the model correctly identifies. It focuses on true positive cases. (Wikipedia)
Formula:
Recall = TP / (TP + FN)
High recall means the model finds most of the positive cases.
F1 Score
F1 score balances precision and recall into a single metric. It is useful when you want a trade-off between these two metrics, especially in imbalanced datasets. (Google for Developers)
Formula:
F1 = 2 × (Precision × Recall) / (Precision + Recall)
F1 score ranges from 0 to 1. Higher values mean better performance.
When to Use Which Metric
-
Accuracy is useful when classes are balanced and errors cost the same. (Google for Developers)
-
Precision is important when false positives are costly (e.g., spam detection). (Google for Developers)
-
Recall is important when missing positive cases is costly (e.g., disease detection). (Google for Developers)
-
F1 Score is preferred when you want a balance between precision and recall, especially with imbalanced datasets. (Google for Developers)
Multi-Class Classification
For more than two classes, precision and recall can be averaged across classes using macro or micro averaging. (evidentlyai.com)
-
Macro-averaging: Average metric across all classes equally
-
Micro-averaging: Average metric weighted by class frequency
Summary of Formulas
Accuracy = (TP + TN) / Total
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Introduction to Topic Modeling
What is Topic Modeling
Topic Modeling is a technique used to automatically discover hidden topics in a collection of documents.
In simple words:
It helps a computer find
“What is this document mainly about?”
without reading it manually.
Example:
From many articles, the model may find topics like:
-
Sports
-
Politics
-
Technology
-
Health
Why Topic Modeling is Needed
Large amounts of text are generated every day:
-
News
-
Blogs
-
Reviews
-
Research papers
Reading all manually is not possible.
So topic modeling helps to:
-
Organize documents
-
Summarize content
-
Find patterns
-
Improve search systems
Basic Idea of Topic Modeling
Topic modeling assumes that:
-
Each document contains multiple topics
-
Each topic contains multiple words
Example:
A document about “Cricket World Cup” may have:
Topic 1: Sports
Topic 2: Team
Topic 3: Tournament
So one document can belong to many topics.
Latent Dirichlet Allocation (LDA)
What is LDA
LDA stands for Latent Dirichlet Allocation.
It is the most popular algorithm for topic modeling.
LDA is a probabilistic model that finds topics from text data.
“Latent” means hidden
“Dirichlet” is a probability distribution
“Allocation” means assigning topics
So:
LDA finds hidden topics and assigns them to documents.
Main Idea of LDA
LDA assumes that:
-
Each document is a mixture of topics
-
Each topic is a mixture of words
In simple words:
-
Document = combination of topics
-
Topic = combination of words
Example:
Document: “AI and Data Science”
Topic 1 (AI): model, learning, neural
Topic 2 (Data): data, analysis, mining
Document = 60% Topic 1 + 40% Topic 2
Probabilistic View of LDA
LDA works using probability.
It tries to find:
-
Probability of topic in a document
-
Probability of word in a topic
So it learns:
P(Topic | Document)
P(Word | Topic)
Generative Process of LDA (How LDA Thinks)
LDA assumes that documents are created like this:
Step 1: Choose topic distribution for document
Step 2: Choose a topic from that distribution
Step 3: Choose a word from that topic
Step 4: Repeat for all words
This is called the generative process.
It means LDA imagines how text was generated.
Important Terms in LDA
Document-Topic Distribution (θ)
Shows how much each topic appears in a document.
Example:
Document 1:
-
Topic 1: 0.5
-
Topic 2: 0.3
-
Topic 3: 0.2
Topic-Word Distribution (φ)
Shows how much each word belongs to a topic.
Example:
Topic: Sports
cricket: 0.2
match: 0.15
player: 0.1
Dirichlet Distribution
Used to control topic and word probabilities.
Two parameters:
α (alpha): controls topic distribution
β (beta): controls word distribution
High value → more spread
Low value → more focused
Applications of LDA
LDA is used in:
-
News classification
-
Research paper analysis
-
Recommendation systems
-
Customer feedback analysis
-
Search engines
-
Social media analysis
Advantages of LDA
Advantages:
-
Automatic topic discovery
-
Works on large text data
-
No need for manual labeling
-
Easy to interpret topics
Limitations of LDA
Limitations:
-
Needs good preprocessing
-
Number of topics must be chosen manually
-
Topics may be unclear
-
Not good for very short text
Comparison: Topic Modeling vs Text Classification
Key Differences
| Feature | Topic Modeling (LDA) | Text Classification |
|---|---|---|
| Labels | Not required | Required |
| Type | Unsupervised | Supervised |
| Output | Topics | Classes |
| Purpose | Discover themes | Predict category |